Biomedical research very often involves data generated from repeated measures experiments. RMeDPower2 is an R package that provides complete functionality to analyse data coming from repeated measures experiments, i.e., where one has repeated measures from the same biological/independent units or samples. RMeDPower2 helps test the modeling assumptions one makes, identify outlier observations, outlier units at different levels of the design, estimates statistical power or perform sample size calculations, estimate parameters of interest and also to visualize the association being tested. The functionality is limited to testing associations of one predictor (continuous or categorical, e.g., disease status or brain pathology) along with one another covariate (e.g., gender status) in the context of hierarchical or crossed experimental designs. This vignette illustrates the use of this package in multiple contexts relevant to biomedical research.
RMeDPower2 defines the experimental design for the data, probability model of the data generating distribution and necessary parameters required for sample size calculation using convenient S4 class objects. It uses these objects in one framework that brings together the functionality implemented in multiple R packages - lme4 (implementation of linear mixed effects models), influence.ME (identification of outlier units), EnvStats (identification of outlier observations), DHARMa (testing of modeling assumptions for non-normal distributions), simr (sample-size calculations) and tidyverse (data manipulation and visualization).
RMeDPower 0.1.8
RMeDPower2RMeDPower2 can be installed using the following commands
install.packages("devtools")
library(devtools)
install_github('gladstone-institutes/RMeDPower2', build_vignettes=TRUE)
##load the library
library(RMeDPower2)
The workflow can be accomplished in 5 main steps.
Read-in the input data as a data frame
library(RMeDPower2)
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: simr
##
## Attaching package: 'simr'
## The following object is masked from 'package:lme4':
##
## getData
## Loading required package: magrittr
## Loading required package: ggplot2
template_dir <- "~/Dropbox (Gladstone)/calcPower f1000_manuscript/Final_paper_materials/Revisions/input_templates/cell_assay_data/"
data <- read.csv(paste0(template_dir, "cell_size_data.csv"), header = T)
design <- readDesign(paste0(template_dir,"design_cell_assay.json"))
model <- readProbabilityModel(paste0(template_dir,"prob_model.json"))
power_param <- readPowerParams(paste0(template_dir,"power_param.json"))
diagnose_res <- diagnoseDataModel(data, design, model)
diagnose_res_update <- diagnoseDataModel(data_update, design_update, model_update)
Run sample-size calculations
power_res <- calculatePower(data, design, model, power_param)
Estimate and visualize parameters of interest.
estimate_res <- getEstimatesOfInterest(data, design, model)
Users will have to ensure that the input data is a table with rows representing the individual observations or responses and columns representing not only the corresponding predictor/independent variables but also other variables that capture the hierarchical or crossed design of how the data was generated. It is important that the columns have names or headers - these column names will be used to define the relevant S4 class objects.
For example, below illustrates the input data from a cellular assay where each of the 2,588 observations represents a measurement of size (cell_size2) of a single cell, from a given cell line (line) that is either from a control or disease condition (classification) and is measured in a given experimental batch (experiment).
## 'data.frame': 2588 obs. of 4 variables:
## $ experiment : Factor w/ 3 levels "experiment1",..: 1 1 1 1 1 1 1 1 1 3 ...
## $ line : Factor w/ 10 levels "cellline1","cellline10",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ classification: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ cell_size2 : num 354 456 351 387 404 ...
RMeDPower2RMeDesign S4 classThe underlying design for the data will be stored in an object of this class. The slots of this class are:
response_column: character that is the column name in the input data table that captures the responses or individual observations. This has to be set by the user.
covariate: character that is the column name in the input data table that captures the covariate information. This will be used as a fixed effect in the mixed effects model of the data. This slot is set to NULL if there are no covariates.
condition_column: character that is the column name in the input data table that captures the predictor/independent variable information. This will be used as a fixed effect in the mixed effects model of the data.This has to be set by the user.
experimental_columns: character vector that represents the column names in the input data table that captures the experimental variables of the design. These will be used as random effects in the mixed effects model of the data. The order of the names in this character vector is important. The first element captures the top hierarchical level of the design, the second element the next level of the hierarchy and so on. The hierarchy is explicitly modeled in the specification of the mixed effects models. The hierarchy will not be modeled for variables specified in the crossed_columns slot (see below). This has to be set by the user.
condition_is_categorical: logical value equal to TRUE if the variable in the condition_column is to be considered as a categorical variable and is equal to FALSE otherwise. Defaults to TRUE.
crossed_columns: character vector that represents the column names in the input data table that are a subset of the values in experimental_columns representing crossed factors. Defaults to NULL.
total_column: character that is the column name in the input table that will be used to offset values in the mixed effects models. Useful for modeling count data. Defaults to NULL.
outlier_alpha: numeric value that is the p-value threshold used to identify outlier observations. Defaults to 0.05.
na_action: character equal to complete or unique. To be used in the context where the input data has multiple responses that will each be sequentially modeled. complete refers to the situation where outlier observations identified for one response will also be identified as outliers for all the other responses while unique refers to the situation where the outlier observations identified for one response will only be used for the model of that particular response. Defaults to complete.
We can create an object of class RMeDesign using the function.
design <- new("RMeDesign")
print(design)
## An object of class "RMeDesign"
## Slot "response_column":
## [1] "response_column"
##
## Slot "covariate":
## NULL
##
## Slot "condition_column":
## [1] "condition_column"
##
## Slot "condition_is_categorical":
## [1] TRUE
##
## Slot "experimental_columns":
## [1] "experimental_column"
##
## Slot "crossed_columns":
## NULL
##
## Slot "total_column":
## NULL
##
## Slot "outlier_alpha":
## [1] 0.05
##
## Slot "na_action":
## [1] "complete"
In practice, for a given data set the design can be input from a jsonfile. The user can use a text editor to modify the design_template.json file available with the package to input the relevant information based on the column names of the input data. For example, the design for the cell size data referred to above can be read using the function readDesign,
template_dir <- "~/Dropbox (Gladstone)/calcPower f1000_manuscript/Final_paper_materials/Revisions/input_templates/cell_assay_data/"
design <- readDesign(paste0(template_dir,"design_cell_assay.json"))
print(design)
## An object of class "RMeDesign"
## Slot "response_column":
## [1] "cell_size2"
##
## Slot "covariate":
## NULL
##
## Slot "condition_column":
## [1] "classification"
##
## Slot "condition_is_categorical":
## [1] TRUE
##
## Slot "experimental_columns":
## [1] "experiment" "line"
##
## Slot "crossed_columns":
## [1] "line"
##
## Slot "total_column":
## NULL
##
## Slot "outlier_alpha":
## [1] 0.05
##
## Slot "na_action":
## [1] "complete"
ProbabilityModel S4 classThe error probability distribution is specified by two slots in objects of the ProbabilityModel class.
error_is_non_normal: logical value that is TRUE if the underlying probability distribution is not assumed to be Normal and is FALSE otherwise.
family_p: character value that specifies the probability distribution. Accepted values are poisson(=poisson(link="log")), binomial(=binomial(link="logit")), bionomial_log(=binomial(link="log")), Gamma_log(=Gamma(link="log")), Gamma(=Gamma(link="inverse")), negative_binomial. Defaults to NULL.
model <- readProbabilityModel(paste0(template_dir,"prob_model.json"))
print(model)
## An object of class "ProbabilityModel"
## Slot "error_is_non_normal":
## [1] FALSE
##
## Slot "family_p":
## NULL
PowerParams S4 classThe parameters necessary for sample-size estimation are stored in the following slots of the PowerParams class.
target_columns: character vector with column names of experimental variables for which the sample-size estimation is to be performedpower_curve: numeric 1 or 0. 1: Power simulation over a range of sample sizes or levels. 0: Power calculation over a single sample size or a level. Defaults to 1.nsimn: The number of simulations to estimate power. Defaults to 10.levels: numeric 1 or 0. 1: Amplify the number of corresponding target parameter. 0: Amplify the number of samples from the corresponding target parameter, ex) If target_columns = c(“experiment”,“cell_line”) and if you want to expand the number of experiment and sample more cells from each cell line, set levels = c(1,0).max_size: Maximum levels or sample sizes to test. Default if set to NULL equals the current level or the current sample size x 5. ex) If max_levels = c(10,5), it will test upto 10 experiments and 5 cell lines.alpha: Type I error for sample-size estimation. Defaults to 0.05.breaks: Levels /sample sizes of the variable to be specified along the power curve. Default if set to NULL equals max(1, round( the number of current levels / 5 ))effect_size: If you know the effect size of your condition variable, the effect size can be provided as a parameter. Default set to NULL, that is, if the effect size is not provided, it will be estimated from your data,icc: Intra-class coefficients for each of the experimental variables. Used only in the situation when error_is_non_normal=FALSE and when the data does not allow variance component estimationpower_param <- readPowerParams(paste0(template_dir,"power_param.json"))
print(power_param)
## An object of class "PowerParams"
## Slot "target_columns":
## [1] "experiment"
##
## Slot "power_curve":
## [1] 1
##
## Slot "nsimn":
## [1] 10
##
## Slot "levels":
## [1] 1
##
## Slot "max_size":
## NULL
##
## Slot "alpha":
## [1] 0.05
##
## Slot "breaks":
## NULL
##
## Slot "effect_size":
## NULL
##
## Slot "icc":
## NULL
RMeDPower2diagnoseDataModel(data, design, model) functionTests the model assumptions using quantile-quantile, residuals vs fitted and residuals versus predicted plots, transforms responses if feasible and identifies outlier observations and also outlier experimental units
calculatePower(data, design, model, power_param) functionPerforms sample-size calculations for given experimental design/target variable
getEstimatesOfInterest(data, design, model) functionEstimates parameter of interest and also plots the resulting association with predictor of interest using residuals from the model that removes the effects of the modeled experimental variables
iPSC lines from control and patient derived iPSCs from sALS patients were obtained through the Answer ALS (AALS) consortium36 . AALS is the largest repository of ALS patient samples, and contains publicly available patient-specific iPSCs and multi-OMIC data from motor neurons derived from those iPSCs—including RNA Seq, proteomics and whole genome sequence data— all of which is matched with longitudinal clinical data (https://dataportal.answerals.org/home)36.
Figure 1: iPSC-derived ALS cell lines
We used 6 control lines and 4 sALS lines. We differentiated each line into motor neurons, transduced with a morphology marker and imaged on day ~25 using RM37-45 as previously described36. Images were processed and the soma size of neurons were captured using contour ellipse46 function adapted for use in python (https://www.python.org/).
template_dir <- "~/Dropbox (Gladstone)/calcPower f1000_manuscript/Final_paper_materials/Revisions/input_templates/cell_assay_data/"
data <- read.csv(paste0(template_dir, "cell_size_data.csv"), header = T)
design <- readDesign(paste0(template_dir,"design_cell_assay.json"))
model <- readProbabilityModel(paste0(template_dir,"prob_model.json"))
power_param <- readPowerParams(paste0(template_dir,"power_param.json"))
Diagnose the data and the model
diagnose_res <- diagnoseDataModel(data, design, model)
## boundary (singular) fit: see help('isSingular')
Check for outlier experimental batches
##see which experimental columns correspond to which variable
print(design@experimental_columns)
## [1] "experiment" "line"
print(diagnose_res$cooks_distance_experimental_column1)
## [,1]
## experiment1 0.1296877
## experiment2 0.1336167
## experiment3 8.2524792
It looks like experiment3 batch seems like an outlier. Now, check for outlier cell lines
print(diagnose_res$cooks_distance_experimental_column2)
## [,1]
## cellline1 0.068029047
## cellline10 0.013060506
## cellline2 0.002640865
## cellline3 0.310579097
## cellline4 0.048632666
## cellline5 0.005197939
## cellline6 0.002647991
## cellline7 0.009832710
## cellline8 0.005479500
## cellline9 0.890608616
The qq plots of the residuals resulting from log transformation of the cell-size responses and the removal of outliers seems to deviate least from normality (though they still do seem non-normal). In any case, we will use these data for parameter estimation.
design_update <- design
data_update <- diagnose_res$Data_updated
print(colnames(data_update))
## [1] "cell_size2_logTransformed_noOutlier" "cell_size2_logTransformed"
## [3] "cell_size2_noOutlier" "experiment"
## [5] "line" "classification"
## [7] "cell_size1" "cell_size2"
## [9] "cell_size3" "cell_size4"
## [11] "cell_size5" "cell_size6"
design_update@response_column <- "cell_size2_logTransformed_noOutlier"
estimate_res <- getEstimatesOfInterest(data_update, design_update, model)
##print model summary output
print(estimate_res[[1]])
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: response_column ~ condition_column + (1 | experimental_column1) +
## (1 | experimental_column2)
## Data: Data
##
## REML criterion at convergence: -3072
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5014 -0.6566 -0.1274 0.5318 3.7581
##
## Random effects:
## Groups Name Variance Std.Dev.
## experimental_column2 (Intercept) 0.0002444 0.01563
## experimental_column1 (Intercept) 0.0028717 0.05359
## Residual 0.0173003 0.13153
## Number of obs: 2549, groups: experimental_column2, 10; experimental_column1, 3
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5.78400 0.03189 2.14285 181.376 1.59e-05 ***
## condition_column1 0.03682 0.01220 8.08924 3.018 0.0164 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## cndtn_clmn1 -0.155
We can also perform calculate the number of experimental batches needed to estimate the observed differences in soma size.
power_res <- calculatePower(data_update, design_update, model, power_param)
We will use data derived from the single-nuclei RNA-seq experiments performed in Koutsodendris et al. paper. One of the goals in these experiments was to assess the effect of the APOE gene isoform on brain pathology in a mouse model of Alzheimer’s Disease (AD). Specifically, APOE4 gene isoform is an known risk factor for this disease in humans compared to those with the APOE3 gene isoform. The paper tests the role of expression of neuronal APOE expression in the development of disease pathology. The design of the data involved order 1000 cells derived from the hippocampus brain region of 4 mice with the human APOE4 gene knocked-in the mouse Apoe locus and 4 other mice (called E4-KI-Syn1-Cre) with the human APOE4 gene also knocked-in in the same mouse. However, the APOE4 expression in these mice is knocked out specifically in the neurons. The single cell data allowed the identification of multiple clusters or cell-types including cluster 7 representing a group of excitatory neurons consisted of nuclei.
Figure 2: single nuclei RNA sequencing
We can first ask the question whether or not the proportion of nuclei derived from the APOE4-KI mice are different in cluster 7 compared to those derived from the APOE4-KI-Syn1-Cre mice. More exactly, we want to test whether or not there is a difference in the log odds of cluster membership in cluster 7 among the APOE4-KI mice versus the APOE4-KI-Syn1-Cre. The data consists of the number of nuclei per animal present in the cluster and also the total number of nuclei that were isolated from this animal.
template_dir <- "~/Dropbox (Gladstone)/calcPower f1000_manuscript/Final_paper_materials/Revisions/input_templates/single_cell_data/celltype_proportions_with_genotype/"
data <- read.csv(paste0(template_dir, "apoe_associated_cell_proportions.csv"), header = T)
head(data)
## sample_id animal_model total_numbers_of_cells_per_sample cluster_id
## 1 S10 PS19-fE4 6552 7
## 2 S11 PS19-fE4 6214 7
## 3 S12 PS19-fE4 Syn1-Cre 7873 7
## 4 S13 PS19-fE4 Syn1-Cre 6577 7
## 5 S4 PS19-fE4 16223 7
## 6 S5 PS19-fE4 12357 7
## number_of_cells_per_sample_in_cluster Genotype Mouse_. Sex PS19 Cre
## 1 263 PS19-fE4 154 F + -
## 2 8 PS19-fE4 413 M + -
## 3 58 PS19-fE4_Syn1-Cre 305 M + +_-
## 4 1 PS19-fE4_Syn1-Cre 504 F + +_-
## 5 2139 PS19-fE4 129 F + -
## 6 1811 PS19-fE4 156 M + -
## ApoE DOB Age_Perfused Date_Perfused Date_of_Nuc_Isolation
## 1 E4_4 10_30_19 10.3 9/7/20 9/9/21
## 2 E4_4 10_2_20 10.1 8/4/21 9/9/21
## 3 E4_4 3_19_20 10.5 1/8/21 9/9/21
## 4 E4_4 9_8_20 10.8 8/4/21 9/9/21
## 5 E4_4 12_18_19 9.9 10/16/20 9/7/21
## 6 E4_4 12_18_19 9.9 10/16/20 9/7/21
## Hippocampus_Vol_mm X._AT8_Coverage_Area X._GFAP_Coverage_Area
## 1 6.75 31.52 31.75
## 2 8.88 4.43 4.98
## 3 9.26 15.43 5.60
## 4 7.58 6.92 11.90
## 5 3.73 69.89 38.28
## 6 3.62 88.85 39.84
## X._S100B_Coverage_Area X._IBA1_Coverage_Area X._CD68_Coverage_Area
## 1 6.94 20.20 10.45
## 2 1.02 10.09 1.58
## 3 0.38 1.17 0.42
## 4 5.05 13.09 3.49
## 5 9.71 30.51 25.49
## 6 18.11 31.49 19.85
## X._MBP_Coverage_Area X._OPC_Coverage_Area
## 1 9.43 10.30
## 2 4.32 6.04
## 3 14.08 24.32
## 4 10.21 21.60
## 5 13.28 11.68
## 6 7.72 10.95
##load the design
design <- readDesign(paste0(template_dir,"design_celltype1.json"))
print(design)
## An object of class "RMeDesign"
## Slot "response_column":
## [1] "number_of_cells_per_sample_in_cluster"
##
## Slot "covariate":
## [1] "Sex"
##
## Slot "condition_column":
## [1] "Genotype"
##
## Slot "condition_is_categorical":
## [1] TRUE
##
## Slot "experimental_columns":
## [1] "sample_id"
##
## Slot "crossed_columns":
## NULL
##
## Slot "total_column":
## [1] "total_numbers_of_cells_per_sample"
##
## Slot "outlier_alpha":
## [1] 0.05
##
## Slot "na_action":
## [1] "complete"
##load the probability model
model <- readProbabilityModel(paste0(template_dir,"prob_model.json"))
print(model)
## An object of class "ProbabilityModel"
## Slot "error_is_non_normal":
## [1] TRUE
##
## Slot "family_p":
## [1] "binomial"
##load the relevant power parameters
power_param <- readPowerParams(paste0(template_dir,"power_param.json"))
print(power_param)
## An object of class "PowerParams"
## Slot "target_columns":
## [1] "sample_id"
##
## Slot "power_curve":
## [1] 1
##
## Slot "nsimn":
## [1] 10
##
## Slot "levels":
## [1] 1
##
## Slot "max_size":
## NULL
##
## Slot "alpha":
## [1] 0.05
##
## Slot "breaks":
## NULL
##
## Slot "effect_size":
## NULL
##
## Slot "icc":
## NULL
Now, we will diagnose the data and the model.
diagnose_res <- diagnoseDataModel(data, design, model)
## [1] "No outlier detected from the raw Data"
We can now estimate the log odds ratio of interest.
estimate_res <- getEstimatesOfInterest(data, design, model)
print(estimate_res[[1]])
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## cbind(response_column, (total_column - response_column)) ~ condition_column +
## covariate + (1 | experimental_column1)
## Data: Data
##
## AIC BIC logLik deviance df.resid
## 100.9 101.2 -46.4 92.9 4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -0.41164 -0.28357 0.00219 0.02755 0.09293
##
## Random effects:
## Groups Name Variance Std.Dev.
## experimental_column1 (Intercept) 3.464 1.861
## Number of obs: 8, groups: experimental_column1, 8
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0029 1.1493 -2.613 0.00898 **
## condition_columnPS19-fE4_Syn1-Cre -3.6302 1.3514 -2.686 0.00723 **
## covariateM -0.7105 1.3482 -0.527 0.59821
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) c_PS19
## c_PS19-E4_S -0.560
## covariateM -0.582 -0.003
And estimate whether or not statistical power we have to estimate the given observed association with varying number of mice per group.
print(power_param@target_columns)
## [1] "sample_id"
power_res <- calculatePower(data, design, model, power_param)
## boundary (singular) fit: see help('isSingular')
Next, we can first ask the question whether or not the proportion of nuclei derived from each of the mice are associated with their hippocampus volumes. Note, more neuro-degeneration is linked to smaller hippocampus volumes. More exactly, we want to test whether or not there is a change in the log odds of cluster membership in cluster 7 corresponding to a unit change in the hippocampus volume. The above data provided includes the hippocampus volume for each mouse.
##Associations of cell-type proportions with hippocampus volume
design@condition_column = "Hippocampus_Vol_mm"
design@condition_is_categorical = FALSE
design@experimental_columns = c("animal_model", "sample_id")
diagnose_res <- diagnoseDataModel(data, design, model)
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
We can also estimate the parameter of interest - log odds ratio of cluster 7 membership per unit increase in hippocampus volume of associated mouse.
estimate_res <- getEstimatesOfInterest(data, design, model)
## boundary (singular) fit: see help('isSingular')
## `geom_smooth()` using formula = 'y ~ x'
And calculate the statistical power to estimate this association for different numbers of animals per group.
power_res <- calculatePower(data, design, model, power_param)
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
We will now estimate the association of the expression levels of the Snhg11 gene with genotype among cells in cluster 7 in the same data set. We will now load the raw counts for this gene and create the relevant RMeDesign, ProbabilityModel and PowerParams objects.
template_dir <- "~/Dropbox (Gladstone)/calcPower f1000_manuscript/Final_paper_materials/Revisions/input_templates/single_cell_data/gene_expression_with_genotype/"
data <- read.csv(paste0(template_dir, "raw_counts_Snhg11_gene_cluster_7_snRNAseq_Koutsodendris_et_al_2023.csv"), header = T)
# data %<>% mutate(genotype = gsub(" ", "-", genotype))
# data %>% write.csv(., paste0(template_dir, "raw_counts_Snhg11_gene_cluster_7_snRNAseq_Koutsodendris_et_al_2023.csv"), row.names = F)
head(data)
## y orig.ident nCount_RNA nFeature_RNA percent.mt sample_number
## 1 20 SeuratProject 2878 1716 0.03474635 S10
## 2 6 SeuratProject 9350 3682 0.05347594 S10
## 3 22 SeuratProject 2949 1735 0.03390980 S10
## 4 12 SeuratProject 4529 2494 0.04415986 S10
## 5 14 SeuratProject 3044 1866 0.03285151 S10
## 6 9 SeuratProject 10636 3957 0.03760812 S10
## mouse_number genotype sex Cre ApoE dob age_perfused date_perfused
## 1 154 PS19-fE4 F - E4/4 10/30/19 10.3 9/7/20
## 2 154 PS19-fE4 F - E4/4 10/30/19 10.3 9/7/20
## 3 154 PS19-fE4 F - E4/4 10/30/19 10.3 9/7/20
## 4 154 PS19-fE4 F - E4/4 10/30/19 10.3 9/7/20
## 5 154 PS19-fE4 F - E4/4 10/30/19 10.3 9/7/20
## 6 154 PS19-fE4 F - E4/4 10/30/19 10.3 9/7/20
## date_of_nuclear_isolation nCount_SCT nFeature_SCT SCT_snn_res.0.7
## 1 9/9/21 2762 1716 6
## 2 9/9/21 2910 2040 6
## 3 9/9/21 2813 1735 6
## 4 9/9/21 3253 2414 6
## 5 9/9/21 2877 1866 6
## 6 9/9/21 2675 1843 6
## seurat_clusters orig_seurat_clusters y.samples.lib.size library_size
## 1 7 6 2762 2762
## 2 7 6 2910 2910
## 3 7 6 2813 2813
## 4 7 6 3253 3253
## 5 7 6 2877 2877
## 6 7 6 2675 2675
## constant_library_size
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
design <- readDesign(paste0(template_dir,"design_gene_expression.json"))
print(design)
## An object of class "RMeDesign"
## Slot "response_column":
## [1] "y"
##
## Slot "covariate":
## [1] "sex"
##
## Slot "condition_column":
## [1] "genotype"
##
## Slot "condition_is_categorical":
## [1] TRUE
##
## Slot "experimental_columns":
## [1] "sample_number"
##
## Slot "crossed_columns":
## NULL
##
## Slot "total_column":
## [1] "library_size"
##
## Slot "outlier_alpha":
## [1] 0.05
##
## Slot "na_action":
## [1] "complete"
model <- readProbabilityModel(paste0(template_dir,"prob_model.json"))
print(model)
## An object of class "ProbabilityModel"
## Slot "error_is_non_normal":
## [1] TRUE
##
## Slot "family_p":
## [1] "negative_binomial"
power_param <- readPowerParams(paste0(template_dir,"power_param.json"))
print(power_param)
## An object of class "PowerParams"
## Slot "target_columns":
## [1] "sample_number"
##
## Slot "power_curve":
## [1] 1
##
## Slot "nsimn":
## [1] 10
##
## Slot "levels":
## [1] 1
##
## Slot "max_size":
## NULL
##
## Slot "alpha":
## [1] 0.05
##
## Slot "breaks":
## NULL
##
## Slot "effect_size":
## NULL
##
## Slot "icc":
## NULL
Note, we will assume the underlying probability distribution for the counts is negative binomial and will control for differences in the sequencing depths or the total reads each cell receives using library_size.
Diagnosis of the data and the model
diagnose_res <- diagnoseDataModel(data, design, model)
Estimate log-fold change of the Snhg11 gene with genotype.
estimate_res <- getEstimatesOfInterest(data, design, model)
print(estimate_res[[1]])
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: Negative Binomial(4.914) ( log )
## Formula:
## response_column ~ condition_column + covariate + (1 | experimental_column1) +
## offset(log(total_column))
## Data: Data
##
## AIC BIC logLik deviance df.resid
## 28790.0 28821.9 -14390.0 28780.0 4297
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8485 -0.7485 -0.1820 0.5882 5.4549
##
## Random effects:
## Groups Name Variance Std.Dev.
## experimental_column1 (Intercept) 0.07378 0.2716
## Number of obs: 4302, groups: experimental_column1, 8
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.1736 0.1756 -29.467 <2e-16 ***
## condition_columnPS19-fE4-Syn1-Cre 0.4622 0.2280 2.028 0.0426 *
## covariateM 0.1356 0.2335 0.581 0.5615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) c_PS19
## c_PS19-E4-S -0.450
## covariateM -0.588 -0.076
We can also perform a power analyses to detect a log(2) fold-change in this gene.
power_param@effect_size <- log(2)
power_res <- calculatePower(data, design, model, power_param)
Morris Water Maze (MWM) is an assay used to test spatial learning in animal models of diseases like Alzheimer’s disease. The time recorded by each mouse to reach a target region (in the MWM) over multiple trials denoted as latency is the response of interest. Mouse models deficient in spatial learning would demonstrate an increased latency to reach the target region across the learning trials.
Figure 3: Morris Water Maze
template_dir <- "~/Dropbox (Gladstone)/calcPower f1000_manuscript/Final_paper_materials/Revisions/input_templates/behavior_data/"
data <- read.csv(paste0(template_dir, "MWM_data.csv"), header = T)
design <- readDesign(paste0(template_dir,"design_behavior.json"))
model <- readProbabilityModel(paste0(template_dir,"prob_model.json"))
power_param <- readPowerParams(paste0(template_dir,"power_param.json"))
diagnose_res <- diagnoseDataModel(data, design, model)
## [1] "No outlier detected from the raw Data"
Based on the above plots, it seems like the log transformed versions of the latency responses better fit the model assumptions. Therefore using these transformed responses we will estimate and visualize our parameter of interest.
data_update <- diagnose_res$Data_updated
design_update <- design
design_update@response_column %<>% paste0(., "_logTransformed")
estimate_res <- getEstimatesOfInterest(data_update, design_update, model)
print(estimate_res[[1]])
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## response_column ~ condition_column + covariate + (1 | experimental_column1) +
## (1 | experimental_column2)
## Data: Data
##
## REML criterion at convergence: 3464.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9587 -0.6407 0.1179 0.7752 2.0354
##
## Random effects:
## Groups Name Variance Std.Dev.
## experimental_column2 (Intercept) 0.05842 0.2417
## experimental_column1 (Intercept) 0.06049 0.2460
## Residual 0.82664 0.9092
## Number of obs: 1272, groups:
## experimental_column2, 106; experimental_column1, 3
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 3.40673 0.17276 3.84608 19.720 5.24e-05 ***
## condition_column1 0.37837 0.06975 102.13381 5.424 3.92e-07 ***
## covariateLATENCY_10 -0.46921 0.12489 1155.00013 -3.757 0.000181 ***
## covariateLATENCY_11 -0.64563 0.12489 1155.00013 -5.170 2.76e-07 ***
## covariateLATENCY_12 -0.69739 0.12489 1155.00013 -5.584 2.93e-08 ***
## covariateLATENCY_2 -0.07767 0.12489 1155.00013 -0.622 0.534093
## covariateLATENCY_3 -0.15209 0.12489 1155.00013 -1.218 0.223531
## covariateLATENCY_4 -0.41716 0.12489 1155.00013 -3.340 0.000864 ***
## covariateLATENCY_5 -0.27506 0.12489 1155.00013 -2.202 0.027828 *
## covariateLATENCY_6 -0.37621 0.12489 1155.00013 -3.012 0.002648 **
## covariateLATENCY_7 -0.32589 0.12489 1155.00013 -2.609 0.009185 **
## covariateLATENCY_8 -0.48713 0.12489 1155.00013 -3.901 0.000102 ***
## covariateLATENCY_9 -0.62461 0.12489 1155.00013 -5.001 6.57e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation matrix not shown by default, as p = 13 > 12.
## Use print(estimate_res[[1]], correlation=TRUE) or
## vcov(estimate_res[[1]]) if you need it
We will also perform sample-size calculations for the number of mice needed per cohort.
power_param@target_columns <- "MouseID"
power_res <- calculatePower(data_update, design_update, model, power_param)
Patch clamp is a technique that can be used to measure membrane action potential in excitable cells like neurons. We will work with (simulated) data from two sets of mice - one representing a Alzheimer’s Disease model (hAPP) and the other representing control or wild-type mice. In each of these mice, multiple brain tissue slices are isolated and the membrane action potential for at least one neuronal cell in each slice is recorded.
Figure 4: Patch clamp assay
template_dir <- "~/Dropbox (Gladstone)/calcPower f1000_manuscript/Final_paper_materials/Revisions/input_templates/electrical_patch_clamp_data/"
data <- read.csv(paste0(template_dir, "patch_clamp_data.csv"), header = T)
head(data)
## negative_action_potential_mV Mice Genotype Sex slice_number
## 1 56.23547 Mouse_1 hAPP Male 1
## 2 47.89597 Mouse_3 hAPP Male 1
## 3 56.20910 Mouse_3 hAPP Male 2
## 4 53.14394 Mouse_9 hAPP Female 1
## 5 53.67793 Mouse_9 hAPP Female 1
## 6 61.21780 Mouse_9 hAPP Female 1
design <- readDesign(paste0(template_dir,"design_patch_clamp.json"))
print(design)
## An object of class "RMeDesign"
## Slot "response_column":
## [1] "negative_action_potential_mV"
##
## Slot "covariate":
## NULL
##
## Slot "condition_column":
## [1] "Genotype"
##
## Slot "condition_is_categorical":
## [1] TRUE
##
## Slot "experimental_columns":
## [1] "Mice" "slice_number"
##
## Slot "crossed_columns":
## NULL
##
## Slot "total_column":
## NULL
##
## Slot "outlier_alpha":
## [1] 0.05
##
## Slot "na_action":
## [1] "complete"
model <- readProbabilityModel(paste0(template_dir,"prob_model.json"))
print(model)
## An object of class "ProbabilityModel"
## Slot "error_is_non_normal":
## [1] FALSE
##
## Slot "family_p":
## NULL
power_param <- readPowerParams(paste0(template_dir,"power_param.json"))
print(power_param)
## An object of class "PowerParams"
## Slot "target_columns":
## [1] "Mice"
##
## Slot "power_curve":
## [1] 1
##
## Slot "nsimn":
## [1] 10
##
## Slot "levels":
## [1] 1
##
## Slot "max_size":
## NULL
##
## Slot "alpha":
## [1] 0.05
##
## Slot "breaks":
## NULL
##
## Slot "effect_size":
## NULL
##
## Slot "icc":
## NULL
We will now evaluate the model.
diagnose_res <-diagnoseDataModel(data, design, model)
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
It does not look like the log transformation resulted significantly changed how well the model assumptions were being met. We will estimate the association of genotype using the responses in their natural scale.
estimate_res <- getEstimatesOfInterest(data, design, model)
print(estimate_res[[1]])
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: response_column ~ condition_column + (1 | experimental_column1) +
## (1 | experimental_column2)
## Data: Data
##
## REML criterion at convergence: 429.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.3970 -0.6443 -0.1594 0.4738 3.2670
##
## Random effects:
## Groups Name Variance Std.Dev.
## experimental_column2 (Intercept) 10.646 3.263
## experimental_column1 (Intercept) 5.427 2.330
## Residual 56.462 7.514
## Number of obs: 62, groups: experimental_column2, 20; experimental_column1, 9
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 60.7603 1.9892 6.0708 30.545 7.04e-08 ***
## condition_columnWT 0.1541 2.9860 6.4725 0.052 0.96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## cndtn_clmWT -0.666
This is not significant at all. How many mice do we need to have 80% power to estimate the observed difference?
power_param@target_columns <- "Mice"
power_param@effect_size <- 7
power_res <- calculatePower(data, design, model, power_param)
power_param@target_columns <- "slice_number"
power_res <- calculatePower(data, design, model, power_param)
RMeDPower2 is designed to simulate the effect of variability of the responses which could come from differences in the processing batch, plates or cell lines on the responses of interest the cell assay data (described above for exampel). There are several ways to assess the impact of different choices for the experimental variables in this package as outlined below.
First, a user can assess how increasing the number of independent experiments affects power. For example, if a user has a pilot dataset that consists of 3 experimental batches that each contain 2 plates, expanding this dataset two-fold would have the effect of simulating 6 experiments with a total of 12 plates 5.
Figure 5: Some cool caption
Alternatively, a user can assess how increasing the number of plates per experiment affects power. In the case where there are two experiments that each contain two plates, the user can double the number of plates used per experiment. In this way, the user can simulate how the statistical power changes as a result of increasing the number of plates used per experiment rather than increasing the number of experimental batches 6.
Figure 6: Some cool caption
In a third example, a user can examine the effect of expanding the number of cell lines within each experiment affects power 7. This would capture the effect of increasing the number of cells assayed as a consequence of having more cell lines. This type of variable expansion can be accomplished by setting ‘level=1’ in the calculate_power function.
Figure 7: Some cool caption
Tables 1-3, in addition to the figures, show the change in cell number after variable expansion in experiment, plate or cell line, especially when the cell distribution is asymmetric. A user may want to examine the power of increasing the total number of cells measured from each experimental variable per experiment. For example, if there are 12 cells per cell line on plate 1, doubling the number of cells from each plate will result in assessing the effect in twice the number of cells per cell line, even if the number of experiments and cell lines remain the same 8.
Figure 8: Some cool caption
Alternatively, one might want to assess the effect of increasing the total number of cells by increasing the number of cells per cell line 9.
Figure 9: Some cool caption
Expansion results for plates and cell lines may differ if cells are asymmetrically distributed in different settings. Tables 4 and 5 show the different results users can get in an asymmetric distributed cell scenario. This type of variable expansion can be accomplished by setting ‘level=0’ in the calculate_power function.
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RMeDPower2_0.1.0 ggplot2_3.4.2 magrittr_2.0.3 simr_1.0.7
## [5] dplyr_1.1.2 lme4_1.1-33 Matrix_1.5-4 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 RLRsim_3.1-8 DHARMa_0.4.6
## [4] farver_2.1.1 fastmap_1.1.1 promises_1.2.0.1
## [7] digest_0.6.31 mime_0.12 lifecycle_1.0.3
## [10] ellipsis_0.3.2 compiler_4.3.0 rlang_1.1.1
## [13] sass_0.4.5 tools_4.3.0 plotrix_3.8-2
## [16] utf8_1.2.3 yaml_2.3.7 knitr_1.42
## [19] labeling_0.4.2 plyr_1.8.8 gap.datasets_0.0.5
## [22] abind_1.4-5 withr_2.5.0 purrr_1.0.1
## [25] numDeriv_2016.8-1.1 grid_4.3.0 fansi_1.0.4
## [28] xtable_1.8-4 colorspace_2.1-0 scales_1.2.1
## [31] iterators_1.0.14 MASS_7.3-59 cli_3.6.1
## [34] rmarkdown_2.21 generics_0.1.3 rstudioapi_0.14
## [37] binom_1.1-1.1 influence.ME_0.9-9 minqa_1.2.5
## [40] cachem_1.0.8 stringr_1.5.0 splines_4.3.0
## [43] parallel_4.3.0 BiocManager_1.30.20 vctrs_0.6.2
## [46] boot_1.3-28.1 jsonlite_1.8.7 carData_3.0-5
## [49] bookdown_0.34 car_3.1-2 pbkrtest_0.5.2
## [52] qgam_1.3.4 magick_2.7.5 foreach_1.5.2
## [55] tidyr_1.3.0 jquerylib_0.1.4 gap_1.5-1
## [58] glue_1.6.2 nloptr_2.0.3 codetools_0.2-19
## [61] stringi_1.7.12 gtable_0.3.3 later_1.3.0
## [64] EnvStats_2.7.0 lmerTest_3.1-3 munsell_0.5.0
## [67] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.5
## [70] R6_2.5.1 doParallel_1.0.17 evaluate_0.20
## [73] shiny_1.7.4 lattice_0.21-8 highr_0.10
## [76] backports_1.4.1 broom_1.0.4 httpuv_1.6.9
## [79] bslib_0.4.2 Rcpp_1.0.10 nlme_3.1-162
## [82] mgcv_1.8-42 xfun_0.39 pkgconfig_2.0.3